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Abstract 

Ordinal regression is an important type of 
learning, which has properties of both clas- 
sification and regression. Here we describe 
a simple and effective approach to adapt a 
traditional neural network to learn ordinal 
categories. Our approach is a generaliza- 
tion of the perceptron method for ordinal 
regression. On several benchmark datasets, 
our method (NNRank) outperforms a neural 
network classification method. Compared 
with the ordinal regression methods using 
Gaussian processes and support vector 
machines, NNRank achieves comparable 
performance. Moreover, NNRank has the 
advantages of traditional neural networks: 
learning in both online and batch modes, 
handling very large training datasets, and 
making rapid predictions. These features 
make NNRank a useful and complementary 
tool for large-scale data processing tasks 
such as information retrieval, web page 
ranking, collaborative filtering, and protein 
ranking in Bioinformatics. 



1. Introduction 

Ordinal regression (or ranking learning) is an impor- 
tant supervised problem of learning a ranking or or- 
dering on instances, which has the property of both 
classification and metric regression. The learning task 
of ordinal regression is to assign data points into a set 
of finite ordered categories. For example, a teacher 
rates students' performance using A, B, C, D, and E 
(A>B>C>D>E) (Chu & Ghahramani, 2005a). 
Ordinal regression is different from classification due 
to the order of categories. In contrast to metric re- 
gression, the response variables (categories) in ordinal 
regression is discrete and finite. 



The research of ordinal regression dated back to the 
ordinal statistics methods in 1980s (McCullagh, 1980; 
McCullagh & Nelder, 1983) and machine learning re- 
search in 1990s (Caruana et al., 1996; Herbrich et al., 
1998; Cohen et al., 1999). It has attracted the con- 
siderable attention in recent years due to its poten- 
tial applications in many data-intensive domains such 
as information retrieval (Herbrich et al., 1998), web 
page ranking (Joachims, 2002), collaborative filtering 
(Goldberg et al., 1992; Basilico & Hofmann, 2004; Yu 
et al., 2006), image retrieval (Wu et al., 2003), and pro- 
tein ranking (Cheng & Baldi, 2006) in Bioinformatics. 

A number of machine learning methods have been de- 
veloped or redesigned to address ordinal regression 
problem (Rajaram et al., 2003), including perceptron 
(Crammer & Singer, 2002) and its kernelizcd gener- 
alization (Basilico & Hofmann, 2004), neural network 
with gradient descent (Caruana et al., 1996; Burges 
et al., 2005), Gaussian process (Chu & Ghahramani, 
2005b; Chu & Ghahramani, 2005a; Schwaighofer et 
al., 2005), large margin classifier (or support vec- 
tor machine) (Herbrich et al., 1999; Herbrich et al., 
2000; Joachims, 2002; Shashua & Levin, 2003; Chu 
& Kcerthi, 2005; Aiolli & Spcrduti, 2004; Chu & 
Keerthi, 2007), k-partite classifier (Agarwal & Roth, 
2005), boosting algorithm (Frcund et al., 2003; Dekel 
et al., 2002), constraint classification (Har-Peled et al., 
2002), regression trees (Kramer et al., 2001), Naive 
Bayes (Zhang et al., 2005), Bayesian hierarchical ex- 
perts (Paquet et al., 2005), binary classification ap- 
proach (Frank & Hall, 2001; Li & Lin, 2006) that de- 
composes the original ordinal regression problem into 
a set of binary classifications, and the optimization of 
nonsmooth cost functions (Burges et al., 2006). 

Most of these methods can be roughly classified into 
two categories: pairwise constraint approach (Herbrich 
et al., 2000; Joachims, 2002; Dekel et al., 2004; Burges 
et al., 2005) and multi-threshold approach (Cram- 
mer & Singer, 2002; Shashua & Levin, 2003; Chu & 
Ghahramani, 2005a). The former is to convert the full 
ranking relation into pairwise order constraints. The 
latter tries to learn multiple thresholds to divide data 
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into ordinal categories. Multi-threshold approaches 
also can be unified under the general, extended binary 
classification framework (Li & Lin, 2006). 

The ordinal regression methods have different advan- 
tages and disadvantages. Prank (Crammer & Singer, 
2002), a perceptron approach that generalizes the bi- 
nary perceptron algorithm to the ordinal multi-class 
situation, is a fast online algorithm. However, like a 
standard perceptron method, its accuracy suffers when 
dealing with non-linear data, while a quadratic kernel 
version of Prank greatly relieves this problem. One 
class of accurate large-margin classifier approaches 
(Hcrbrich et al., 2000; Joachims, 2002) convert the 
ordinal relations into 0(n 2 ) (n: the number of data 
points) pairwise ranking constraints for the structural 
risk minimization (Vapnik, 1995; Schoclkopf & Smola, 
2002). Thus, it can not be applied to medium size 
datasets (> 10,000 data points), without discarding 
some pairwise preference relations. It may also overfit 
noise due to incomparable pairs. 

The other class of powerful large-margin classifier 
methods (Shashua & Levin, 2003; Chu & Keerthi, 
2005) generalize the support vector formulation for or- 
dinal regression by finding K — 1 thresholds on the 
real line that divide data into K ordered categories. 
The size of this optimization problem is linear in the 
number of training examples. However, like support 
vector machine used for classification, the prediction 
speed is slow when the solution is not sparse, which 
makes it not appropriate for time-critical tasks. Simi- 
larly, another state-of-the-art approach, Gaussian pro- 
cess method (Chu & Ghahramani, 2005a), also has the 
difficulty of handling large training datasets and the 
problem of slow prediction speed in some situations. 

Here we describe a new neural network approach for 
ordinal regression that has the advantages of neural 
network learning: learning in both online and batch 
mode, training on very large dataset (Burges et al., 
2005), handling non- linear data, good performance, 
and rapid prediction. Our method can be considered 
a generalization of the perceptron learning (Crammer 
& Singer, 2002) into multi-layer perceptrons (neural 
network) for ordinal regression. Our method is also 
related to the classic generalized linear models (e.g., 
cumulative logit model) for ordinal regression (Mc- 
Cullagh, 1980). Unlike the neural network method 
(Burges et al., 2005) trained on pairs of examples 
to learn pairwise order relations, our method works 
on individual data points and uses multiple output 
nodes to estimate the probabilities of ordinal cate- 
gories. Thus, our method falls into the category of 
multi-threshold approach. The learning of our method 



proceeds similarly as traditional neural networks using 
back-propagation (Rumclhart et al., 1986). 

On the same benchmark datasets, our method yields 
the performance better than the standard classifica- 
tion neural networks and comparable to the state-of- 
the-art methods using support vector machines and 
Gaussian processes. In addition, our method can learn 
on very large datasets and make rapid predictions. 

2. Method 
2.1. Formulation 

Let D represent an ordinal regression dataset consist- 
ing of n data points (x, y) , where x € R d is an input 
feature vector and y is its ordinal category from a fi- 
nite set Y. Without loss of generality, we assume that 
Y = 1, 2, K with " <" as order relation. 

For a standard classification neural network without 
considering the order of categories, the goal is to pre- 
dict the probability of a data point x belonging to 
one category k (y = k). The input is x and the 
target of encoding the category k is a vector t = 
(0, ...,0, 1,0, ...,0), where only the element tk is set to 
1 and all others to 0. The goal is to learn a function 
to map input vector x to a probability distribution 
vector o = (01,02, ...Ofc, ...or;), where Ok is closer to 1 
and other elements are close to zero, subject to the 
constraint J2iLi o» = 1- 

In contrast, like the perceptron approach (Crammer & 
Singer, 2002), our neural network approach considers 
the order of the categories. If a data point x belongs 
to category k, it is classified automatically into lower- 
order categories (1, 2, k — 1) as well. So the target 
vector of x is t = (1, 1, .., 1, 0,0,0), where tj (1 < i < k) 
is set to 1 and other elements zeros. Thus, the goal 
is to learn a function to map the input vector x to 
a probability vector o = (01, 02, Ok, ...Ok), where 
o, (i < k) is close to 1 and Oj (i > k) is close to 0. 
T^i=i °i 1S the estimate of number of categories (i.e. 
k) that x belongs to, instead of 1. The formulation 
of the target vector is similar to the perceptron ap- 
proach (Crammer & Singer, 2002). It is also related 
to the classical cumulative probit model for ordinal re- 
gression (McCullagh, 1980), in the sense that we can 
consider the output probability vector (01, ...Ok, ■■■Ok) 
as a cumulative probability distribution on categories 
V K o 

(l,...,k,...,K), i.e., is the proportion of cate- 

gories that x belongs to, starting from category 1. 

The target encoding scheme of our method is related to 
but, different from multi-label learning (Bishop, 1996) 
and multiple label learning (Jin & Ghahramani, 2003) 
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because our method imposes an order on the labels (or 
categories) . 

2.2. Learning 

Under the formulation, we can use the almost exactly 
same neural network machinery for ordinal regression. 
We construct a multi-layer neural network to learn 
ordinal relations from D. The neural network has d 
inputs corresponding to the number of dimensions of 
input feature vector x and K output nodes correspond- 
ing to K ordinal categories. There can be one or more 
hidden layers. Without loss of generality we use one 
hidden layer to construct a standard two-layer feedfor- 
ward neural network. Like a standard neural network 
for classification, input nodes are fully connected with 
hidden nodes, which in turn are fully connected with 
output nodes. Likewise, the transfer function of hid- 
den nodes can be linear function, sigmoid function, 
and tanh function that is used in our experiment. The 
only difference from traditional neural network lies in 
the output layer. Traditional neural networks use soft- 
max =! — — — (or normalized exponential function) for 

output nodes, satisfying the constraint that the sum of 
outputs Y^f=i °i is 1- %i is the net input to the output 
node Oi. 

In contrast, each output node Oi of our neural net- 
work uses a standard sigmoid function 1+ ^- Zi , with- 
out including the outputs from other nodes. Output 
node Oi is used to estimate the probability Oj that a 
data point belongs to category i independently, with- 
out subjecting to normalization as traditional neural 
networks do. Thus, for a data point x of category 
fc, the target vector is (1, , 1, .., 1, 0, 0, 0), in which the 
first fc elements is 1 and others 0. This sets the target 
value of output nodes Oi (i < k) to 1 and Oi (i > k) 
to 0. The targets instruct the neural network to ad- 
just weights to produce probability outputs as close 
as possible to the target vector. It is worth pointing 
out that using independent sigmoid functions for out- 
put nodes does not guaranteed the monotonic relation 
(oi >= o 2 >= ... >= ok), which is not necessary but, 
desirable for making predictions (Li & Lin, 2006). A 
more sophisticated approach is to impose the inequal- 
ity constraints on the outputs to improve the perfor- 
mance. 

Training of the neural network for ordinal regres- 
sion proceeds very similarly as standard neural net- 
works. The cost function for a data point x can 
be relative entropy or square error between the tar- 
get vector and the output vector. For relative en- 
tropy, the cost function for output nodes is f c = 
Yh=i (** l°g °i + (1 - **) lo g(! - oi)). For square er- 



ror, the error function is f c — J2i=i (*« — °i) 2 ■ P rc ~ 
vious studies (Richard & Lippman, 1991) on neural 
network cost functions show that relative entropy and 
square error functions usually yield very similar re- 
sults. In our experiments, we use square error function 
and standard back-propagation to train the neural net- 
work. The errors are propagated back to output nodes, 
and from output nodes to hidden nodes, and finally to 
input nodes. 

Since the transfer function f t of output node Oi is 
the independent sigmoid function 1+ *-h > ^ ne deriva- 
tive of f t of output node O l is |^ = ^ e -l^ 2 = 
i+ e -*i (1 " i+e-'i ) = — °*)" Thus, the net error 
propagated to output node Oi is f^f^ 1 = ^i^ -) x 
o»(l — Oi) = ti — Oi for relative entropy cost function, 
- -2{ti-Oi)xoi{\-Oi) = -2oi(ti-Oi){l-Oi) 
for square error cost function. The net errors are prop- 
agated through neural networks to adjust weights us- 
ing gradient descent as traditional neural networks do. 

Despite the small difference in the transfer function 
and the computation of its derivative, the training of 
our method is the same as traditional neural networks. 
The network can be trained on data in the online 
mode where weights are updated per example, or in 
the batch mode where weights are updated per bunch 
of examples. 

2.3. Prediction 

In the test phase, to make a prediction, our method 
scans output nodes in the order 0\, O2, Ok- It 
stops when the output of a node is smaller than the 
predefined threshold T (e.g., 0.5) or no nodes left. The 
index k of the last node O fc whose output is bigger than 
T is the predicted category of the data point. 

3. Experiments and Results 

3.1. Benchmark Data and Evaluation Metric 

We use eight standard datasets for ordinal regres- 
sion (Chu & Ghahramani, 2005a) to benchmark our 
method. The eight datasets (Diabetes, Pyrimidines, 
Triazines, Machine CUP, Auto MPG, Boston, Stocks 
Domain, and Abalone) are originally used for metric 
regression. Chu and Ghahramani (Chu & Ghahra- 
mani, 2005a) discretized the real-value targets into 
five equal intervals, corresponding to five ordinal cat- 
egories. The authors randomly split each dataset into 
training/test datasets and repeated the partition 20 
times independently. We use the exactly same parti- 
tions as in (Chu & Ghahramnai, 2005a) to train and 
test our method. 
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We use the online mode to train neural networks. The 
parameters to tune are the number of hidden units, the 
number of epochs, and the learning rate. We create 
a grid for these three parameters, where the hidden 
unit number is in the range [1..15], the epoch number 
in the set (50,200,500, 1000), and the initial learning 
rate in the range [0.01. .0.5]. During the training, the 
learning rate is halved if training errors continuously 
go up for a pre-defined number (40, 60, 80, or 100) of 
epochs. For experiments on each data split, the neural 
network parameters are fully optimized on the training 
data without using any test data. 

For each experiment, after the parameters are opti- 
mized on the training data, we train five models on 
the training data with the optimal parameters, start- 
ing from different initial weights. The ensemble of five 
trained models are then used to estimate the general- 
ized performance on the test data. That is, the average 
output of five neural network models is used to make 
predictions. 

We evaluate our method using zero-one error and mean 
absolute error as in (Chu & Ghahramani, 2005a). 
Zero-one error is the percentage of wrong assignments 
of ordinal categories. Mean absolute error is the root 
mean square difference between assigned categories 
(k') and true categories (k) of all data points. For 
each dataset, the training and evaluation process is 
repeated 20 times on 20 data splits. Thus, we com- 
pute the average error and the standard deviation of 
the two metrics as in (Chu & Ghahramani, 2005a). 

3.2. Comparison with Neural Network 
Classification 

We first compare our method (NNRank) with a stan- 
dard neural network classification method (NNClass). 
We implement both NNRank and NNClass using 
C++. NNRank and NNClass share most code with 
minor difference in the transfer function of output 
nodes and its derivative computation as described in 
Section El 

As Table [I] shows, NNRank outperforms NNClass in 
all but one case in terms of both the mean-zero error 
and the mean absolute error. And on some datasets 
the improvement of NNRank over NNClass is sizable. 
For instance, on the Stock and Pyrimidines datasets, 
the mean zero-one error of NNRank is about 4% less 
than NNClass; on four datasets (Stock, Pyrimidines, 
Triazincs, and Diabetes) the mean absolute error is 
reduced by about .05. The results show that the or- 
dinal regression neural network consistently achieves 
the better performance than the standard classifica- 
tion neural network. To futher verify the effectiveness 



of the neural network ordinal regression approach, we 
are currently evaluating NNRank and NNclass on very 
large ordinal regression datasets in the bioinformatics 
domain (work in progress). 

3.3. Comparison with Gaussian Processes and 
Support Vector Machines 

To further evaluate the performance of our method, we 
compare NNRank with two Gaussian process meth- 
ods (GP-MAP and GP-EP) (Chu & Ghahramani, 
2005a) and a support vector machine method (SVM) 
(Shashua & Levin, 2003) implemented in (Chu & 
Ghahramani, 2005a). The results of the three meth- 
ods are quoted from (Chu & Ghahramani, 2005a). Ta- 
ble [2] reports the zero-one error on the eight datasets. 
NNRank achieves the best results on Diabetes, Tri- 
azines, and Abalone, GP-EP on Pyrimidines, Auto 
MPG, and Boston, GP-MAP on Machine, and SVM 
on Stocks. 

Table [3] reports the mean absolute error on the eight 
datasets. NNRank yields the best results on Diabetes 
and Abalone, GP-EP on Pyrimidines, Auto MPG, and 
Boston, GP-MAP on Triazines and Machine, SVM on 
Stocks. 

In summary, on the eight datasets, the performance 
of NNRank is comparable to the three state-of-the-art 
methods for ordinal regression. 

4. Discussion and Future Work 

We have described a simple yet novel approach to 
adapt traditional neural networks for ordinal regres- 
sion. Our neural network approach can be consid- 
ered a generalization of one-layer perceptron approach 
(Crammer & Singer, 2002) into multi-layer. On the 
standard benchmark of ordinal regression, our method 
outperforms standard neural networks used for classi- 
fication. Furthermore, on the same benchmark, our 
method achieves the similar performance as the two 
state-of-the-art methods (support vector machines and 
Gaussian processes) for ordinal regression. 

Compared with existing methods for ordinal regres- 
sion, our method has several advantages of neural net- 
works. First, like the perceptron approach (Crammer 
& Singer, 2002), our method can learn in both batch 
and online mode. The online learning ability makes 
our method a good tool for adaptive learning in the 
real-time. The multi-layer structure of neural network 
and the non-linear transfer function give our method 
the stronger fitting ability than perceptron methods. 

Second, the neural network can be trained on very 
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large datasets iteratively, while training is more com- 
plex than support vector machines and Gaussian pro- 
cesses. Since the training process of our method is the 
same as traditional neural networks, average neural 
network users can use this method for their tasks. 

Third, neural network method can make rapid 
prediction once models are trained. The ability of 
learning on very large dataset and predicting in 
time makes our method a useful and competitive 
tool for ordinal regression tasks, particularly for 
time-critical and large-scale ranking problems in 
information retrieval, web page ranking, collaborative 
filtering, and the emerging fields of Bioinformat- 
ics. We are currently applying the method to 
rank proteins according to their structural rele- 
vance with respect to a query protein (Cheng & 
Baldi, 2006). To facilitate the application of this 
new approach, we make both NNRank and NNClass 
to accept a general input format and freely available at 
http://www.eecs.ucf.edu/~jcheng/chengjioftware.html 

There are some directions to further improve the neu- 
ral network (or multi-layer perceptron) approach for 
ordinal regression. One direction is to design a trans- 
fer function to ensure the monotonic decrease of the 
outputs of the neural network; the other direction 
is to derive the general error bounds of the method 
under the binary classification framework (Li & Lin, 
2006). Furthermore, the other flavors of implemen- 
tations of the multi-threshold multi-layer perceptron 
approach for ordinal regression are possible. Since ma- 
chine learning ranking is a fundamental problem that 
has wide applications in many diverse domains such 
as web page ranking, information retrieval, image re- 
trieval, collaborative filtering, bioinformatics and so 
on, we believe the further exploration of the neural net- 
work (or multi-layer perceptron) approach for ranking 
and ordinal regression is worthwhile. 
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Table 1. The results of NNRank and NNClass on the eight datasets. The results are the average error over 20 trials along 
with the standard deviation. 





Mean zero-one error 


Mean absolute error 


Dataset 


NNRank 


NNClass 


NNRank 


NNClass 


Stocks 
Pyrimidines 
Auto MPG 
Machine 
Abalone 
Triazines 
Boston 
Diabetes 


12.68±1.8% 
37.71±8.1% 
27.13±2.0% 
17.03±4.2% 
21.39±0.3% 
52.55±5.0% 
26.38±3.0% 
44.90±12.5% 


16.97± 2.3% 
41.87±7.9% 
28.82±2.7% 
17.80±4.4% 
21.74± 0.4% 
52.84±5.9% 
26.62±2.7% 
43.84±10.0% 


0.127±0.01 
0.450±0.09 
0.281±0.02 
0.186±0.04 
0.226±0.01 
0.730±0.06 
0.295±0.03 
0.546±0.15 


0.173±0.02 
0.508±0.11 
0.307±0.03 
0.192±0.06 
0.232±0.01 
0.790±0.09 
0.297±0.03 
0.592±0.09 



TaWe 2. Zero-one error of NNRank, SVM, CP-MAP, and GP-EP on the eight datasets. SVM denotes the support vector 
machine method (Shashua & Levin, 2003; Chu & Ghahramani, 2005a). CP-MAP and GP-EP are two Gaussian process 
methods using Laplace approximation (MacKay, 1992) and expectation propagation (Minka, 2001) respectively (Chu & 
Ghahramani, 2005a). The results are the average error over 20 trials along with the standard deviation. We use boldface 
to denote the best results. 



Data 


NNRank 


SVM 


CP-MAP 


GP-EP 


Triazines 
Pyrimidines 
Diabetes 
Machine 
Auto MPG 
Boston 
Stocks 
Abalone 


52.55±5.0% 

37.71±8.1% 
44.90±12.5% 
17.03±4.2% 
27.13±2.0% 
26.38±3.0% 
12.68±1.8% 
21.39±0.3% 


54.19±1.5% 
41.46±8.5% 
57.31±12.1% 
17.37±3.6% 
25.73±2.2% 
25.56±2.0% 
10.81±1.7% 
21.58±0.3% 


52.91±2.2% 
39.79±7.2% 
54.23±13.8% 
16.53±3.6% 
23.78±1.9% 
24.88±2.0% 
11.99±2.3% 
21.50±0.2% 


52.62±2.7% 
36.46±6.5% 
54.23±13.8% 

16.78±3.9% 
23.75±1.7% 
24.49±1.9% 

12.00±2.1% 

21.56±0.4% 



TaWe 3. Mean absolute error of NNRank, SVM, GP-MAP, and GP-EP on the eight datasets. SVM denotes the support 
vector machine method (Shashua & Levin, 2003; Chu & Ghahramani, 2005a). GP-MAP and GP-EP are two Gaussian 
process methods using Laplace approximation and expectation propagation respectively (Chu & Ghahramani, 2005a). 
The results are the average error over 20 trials along with the standard deviation. We use boldface to denote the best 
results. 



Data 


NNRank 


SVM 


GP-MAP 


GP-EP 


Triazines 


0.730±0.07 


0.698±0.03 


0.687±0.02 


0.688±0.03 


Pyrimidines 


0.450±0.10 


0.450±0.11 


0.427±0.09 


0.392±0.07 


Diabetes 


0.546±0.15 


0.746±0.14 


0.662±0.14 


0.665±0.14 


Machine 


0.186±0.04 


0.192±0.04 


0.185±0.04 


0.186±0.04 


Auto MPG 


0.281±0.02 


0.260±0.02 


0.241±0.02 


0.241±0.02 


Boston 


0.295±0.04 


0.267±0.02 


0.260±0.02 


0.259±0.02 


Stocks 


0.127±0.02 


0.108±0.02 


0.120±0.02 


0.120±0.02 


Abalone 


0.226±0.01 


0.229±0.01 


0.232±0.01 


0.234±0.01 



